This paper introduces a learned hierarchical B-frame coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. B-frame coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve content-adaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year's challenge on commonly used datasets in terms of PSNR-YUV.
translated by 谷歌翻译
本文提出了一种增强学习(RL)框架,该框架利用Frank-Wolfe策略优化来解决利益区域(ROI)内部框架内编码的编码-Tree-Unit(CTU)位分配。大多数以前的基于RL的方法采用了单批评家设计,其中失真最小化和速率正则化的奖励是通过经验选择的超参数加权的。最近,提出了双批评设计,以通过交替的速度和失真批评者来更新演员。但是,它的收敛不能保证。为了解决这些问题,我们介绍了神经弗兰克 - 沃尔夫政策优化(NFWPO),以将CTU级分配作为动作约束的RL问题进行制定。在这个新框架中,我们利用费率评论家来预测一套可行的行动。借助这套可行的集合,援引失真的评论家来更新演员,以最大程度地提高ROI加权图像质量受速率约束。用X265产生的实验结果证实了所提出的方法比其他基线的优越性。
translated by 谷歌翻译
这项工作介绍了称为B-CANF的B帧编码框架,该框架利用有条件的增强标准化流量来进行B框架编码。学到的B框架编码的探索较少,更具挑战性。B-CANF是由有条件的P框架编码的最新进展的动机,是将基于流的模型应用于条件运动和框架间编码的首次尝试。B-CANF功能帧型自适应编码,该编码可以学习层次B框架编码更好的位分配。B-Canf还引入了一种特殊类型的B帧,称为B*-Frame,以模拟P框架编码。在常用数据集上,B-CANF达到了最新的压缩性能,在随机访问配置下显示了与HM-16.23相当的BD速率结果(在PSNR-RGB方面)。
translated by 谷歌翻译
本文基于条件增强归一化流(ANF),介绍了一种基于端到端的学习视频压缩系统,称为CANF-VC。大多数博学的视频压缩系统采用与传统编解码器相同的基于混合的编码体系结构。关于条件编码的最新研究表明,基于混合的编码的亚地区,并为深层生成模型打开了在创建新编码框架中发挥关键作用的机会。 CANF-VC代表了一种新的尝试,该尝试利用条件ANF学习有条件框架间编码的视频生成模型。我们之所以选择ANF,是因为它是一种特殊类型的生成模型,其中包括各种自动编码器作为一种特殊情况,并且能够获得更好的表现力。 CANF-VC还将条件编码的想法扩展到运动编码,形成纯粹的条件编码框架。对常用数据集的广泛实验结果证实了CANF-VC对最新方法的优越性。
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the bio-medical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect the annotation entity's interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of Peak Ground Truth (PGT) is introduced. PGT marks the point beyond which an increase in similarity with the reference annotation stops translating to better Real World Model Performance (RWMP). Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, three categories of PGT-aware strategies to evaluate and improve model performance are reviewed.
translated by 谷歌翻译
We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.
translated by 谷歌翻译
Collaboration among industrial Internet of Things (IoT) devices and edge networks is essential to support computation-intensive deep neural network (DNN) inference services which require low delay and high accuracy. Sampling rate adaption which dynamically configures the sampling rates of industrial IoT devices according to network conditions, is the key in minimizing the service delay. In this paper, we investigate the collaborative DNN inference problem in industrial IoT networks. To capture the channel variation and task arrival randomness, we formulate the problem as a constrained Markov decision process (CMDP). Specifically, sampling rate adaption, inference task offloading and edge computing resource allocation are jointly considered to minimize the average service delay while guaranteeing the long-term accuracy requirements of different inference services. Since CMDP cannot be directly solved by general reinforcement learning (RL) algorithms due to the intractable long-term constraints, we first transform the CMDP into an MDP by leveraging the Lyapunov optimization technique. Then, a deep RL-based algorithm is proposed to solve the MDP. To expedite the training process, an optimization subroutine is embedded in the proposed algorithm to directly obtain the optimal edge computing resource allocation. Extensive simulation results are provided to demonstrate that the proposed RL-based algorithm can significantly reduce the average service delay while preserving long-term inference accuracy with a high probability.
translated by 谷歌翻译
The traditional statistical inference is static, in the sense that the estimate of the quantity of interest does not affect the future evolution of the quantity. In some sequential estimation problems however, the future values of the quantity to be estimated depend on the estimate of its current value. This type of estimation problems has been formulated as the dynamic inference problem. In this work, we formulate the Bayesian learning problem for dynamic inference, where the unknown quantity-generation model is assumed to be randomly drawn according to a random model parameter. We derive the optimal Bayesian learning rules, both offline and online, to minimize the inference loss. Moreover, learning for dynamic inference can serve as a meta problem, such that all familiar machine learning problems, including supervised learning, imitation learning and reinforcement learning, can be cast as its special cases or variants. Gaining a good understanding of this unifying meta problem thus sheds light on a broad spectrum of machine learning problems as well.
translated by 谷歌翻译